Search CORE

14 research outputs found

Sentence alignment in DPC: maximizing precision, minimizing human effort

Author: Macken Lieve
Paulussen Hans
Trushkina Julia
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2008
Field of study

A wide spectrum of multilingual applications have aligned parallel corpora as their prerequisite. The aim of the project described in this paper is to build a multilingual corpus where all sentences are aligned at very high precision with a minimal human effort involved. The experiments on a combination of sentence aligners with different underlying algorithms described in this paper showed that by verifying only those links which were not recognized by at least two aligners, an error rate can be reduced by 93.76% as compared to the performance of the best aligner. Such manual involvement concerned only a small portion of all data (6%). This significantly reduces a load of manual work necessary to achieve nearly 100% accuracy of alignment

Ghent University Academic Bibliography

Archivsystem Ask23

Recent developments in linguistic annotations of the TüBa-D/Z treebank

Author: Hinrichs Erhard
Kübler Sandra
Naumann Karin
Telljohann Heike
Trushkina Julia
Publication venue
Publication date: 01/01/2004
Field of study

The purpose of this paper is to describe recent developments in the morphological, syntactic, and semantic annotation of the TüBa-D/Z treebank of German. The TüBa-D/Z annotation scheme is derived from the Verbmobil treebank of spoken German [4, 10], but has been extended along various dimensions to accommodate the characteristics of written texts. TüBa-D/Z uses as its data source the "die tageszeitung" (taz) newspaper corpus. The Verbmobil treebank annotation scheme distinguishes four levels of syntactic constituency: the lexical level, the phrasal level, the level of topological fields, and the clausal level. The primary ordering principle of a clause is the inventory of topological fields, which characterize the word order regularities among different clause types of German, and which are widely accepted among descriptive linguists of German [3, 6]. The TüBa-D/Z annotation relies on a context-free backbone (i.e. proper trees without crossing branches) of phrase structure combined with edge labels that specify the grammatical function of the phrase in question. The syntactic annotation scheme of the TüBa-D/Z is described in more detail in [12, 11]. TüBa-D/Z currently comprises approximately 15 000 sentences, with approximately 7 000 sentences being in the correction phase. The latter will be released along with an updated version of the existing treebank before the end of this year. The treebank is available in an XML format, in the NEGRA export format [1] and in the Penn treebank bracketing format. The XML format contains all types of information as described above, the NEGRA export format contains all sentenceinternal information while the Penn treebank format includes only those layers of information that can be expressed as pure tree structures. Over the course of the last year, more fine grained linguistic annotations have been added along the following dimensions: 1. the basic Stuttgart-Tübingen tagset, STTS, [9] labels have been enriched by relevant features of inflectional morphology, 2. named entity information has been encoded as part of the syntactic annotation, and 3. a set of anaphoric and coreference relations has been added to link referentially dependent noun phrases. In the following sections, we will describe each of these innovations in turn and will demonstrate how the additional annotations can be incorporated into one comprehensive annotation scheme

Hochschulschriftenserver - Universität Frankfurt am Main

Dutch parallel corpus: a multifunctional and multilingual corpus

Author: Desmet Piet
Macken Lieve
Paulussen Hans
Trushkina Julia
Vandeweghe Willy
Publication venue
Publication date: 01/01/2006
Field of study

Ghent University Academic Bibliography

Dutch parallel corpus : a multilingual annotated corpus

Author: Desmet Piet
Macken Lieve
Paulussen Hans
Rura Lidia
Trushkina Julia
Vandeweghe Willy
Publication venue
Publication date: 01/01/2007
Field of study

Ghent University Academic Bibliography

Morphosyntaktische Annotation und Dependenzparsing des Deutschen

Author: Trushkina Julia
Publication venue: Universität Tübingen
Publication date: 20/12/2004
Field of study

The parsing of natural language relies on the syntactic characteristics of words. The part of speech category is one of the most common sources of information in parsing. In the parsing of highly inflectional languages, morphological information, such as case, number and gender, also plays an important role. It helps to resolve syntactic ambiguity in shallow parsing and is particularly useful in dependency parsing of languages with free word order, since it partly determines the argument structure of the sentence. For German, a highly inflectional language with partially free word order, the problem of assigning morpho-syntactic categories, such as part of speech, case, number, gender, person, tense} and mood, i.e. the problem of morpho-syntactic annotation, is complicated by the high ambiguity inherent in tokens. Moreover, the partially paradigm-dependent case syncretism of this language makes the problem particularly intricate. This thesis is concerned with the automatic morpho-syntactic annotation of German. Different approaches to the task are investigated in this thesis. A hybrid system with rule-based and statistical modules that combines the relative strengths of the rule-based and statistical methods involved is presented. The rule-based module is based on the Xerox Incremental Deep Parsing System and provides a novel constraint-based framework that integrates phrase-internal concord rules and phrase-external syntactic heuristics into one uniform architecture. The rule-based module successfully reduces the candidate analyses provided by a morphological analyzer. The statistical module is based on a novel use of probabilistic phrase-structure grammars for morpho-syntactic annotation. The module resolves the remaining cases of ambiguity, providing unambiguous and highly accurate output. The usefulness of morpho-syntactic information is evaluated empirically in the creation of a dependency parser for German. The input to the parser is limited to tokens and their morpho-syntactic characteristics. The parser reaches state-of-the-art performance.Das Parsing natürlicher Sprache hängt von den syntaktischen Kategorien der Wörter ab: Die POS-Kategorie ist eine der am häufigsten verwendeten Informationsquelle für das Parsing. Beim Parsing stark flektierender Sprachen spielt morphologische Information, wie Kasus, Numerus und Genus, ein wichtige Rolle. Sie hilft dabei, syntaktische Ambiguität beim Shallow Parsing aufzulösen und stellt sich als besonders nützlich heraus, wenn sie auf Sprachen mit relativ freier Wortfolge angewandt wird, da sie die Argumentenstruktur eines Satzes teilweise mitbestimmt. Im Deutschen, einer stark flektierenden Sprache mit teilweise freier Wortfolge, ist das Problem der Zuordung morphosyntaktischer Kategorien, wie POS, Kasus, Numerus, Genus, Person, Tempus und Modus, schwierig, da die Tokens eine hohe Ambiguität besitzen. Zusätzlich verkompliziert wird das Problem durch einen teilweise paradigmaabhängigen Synkretismus im Kasus, der dieser Sprache eigen ist. Diese Arbeit beschäftigt sich mit der automatischen morphosyntaktischen Annotation im Deutschen. Verschiedene Ansätze, diese Aufgabe zu bewältigen, wurden erarbeitet und ein hybrides System mit einem regelbasierten und einem statistischen Modul wird vorgestellt, das die Stärken regelbasierter und statistischer Methoden vereint. Das regelbasierte Modul basiert auf dem Xerox Incremental Deep Parsing System und bildet ein neues constraint-basiertes System, das phraseninterne Kongruenzregeln und phrasenexterne syntaktische Heuristiken in eine einheitliche Architektur integriert. Das regelbasierte Modul reduziert die von der morphologischen Analyse gelieferten möglichen Analysen erfolgreich. Das statistische Modul basiert auf einer neuartigen Nutzung probabilistischer Phrasenstrukturgrammatiken zur morphosyntaktischen Annotation. Es löst die verbleibenden Fälle von Ambiguität und liefert präzise und vollständig desambiguierte Analysen. Der Nutzen morphosyntaktischer Information wird durch den Aufbau eines Dependenz-Parsers für das Deutsche empirisch evaluiert. Die Eingabe für den Parser ist auf die Tokens und deren morphosyntaktische Eigeschaften beschränkt. Der Paser erreicht eine State-Of-The-Art-Performanz

Publikationsserver der Universität Tübingen

Dutch Parallel Corpus: MT Corpus and Translator’s aid

Author: Julia Trushkina
Lidia Rura
Lieve Macken
Publication venue
Publication date: 01/01/2007
Field of study

This paper reports on the development of the Dutch Parallel Corpus: a high quality sentence-aligned parallel corpus of 10 million words for the language pairs Dutch-English and Dutch-French. The corpus is composed of different text types. All steps of processing the corpus including alignment and linguistic annotation undergo quality control on different levels. Four categories of potential users of the DPC can be distinguished: developers of HLT-applications, linguists conducting more fundamental research, human translators and language learners. This paper focuses on two types of intended users: MT developers and human translators. The paper describes different characteristics of the corpus relevant for such users, concentrating on corpus design, processing of the corpus data and the exploitation of the corpus

CiteSeerX

Ghent University Academic Bibliography

Rule-based and Statistical Approaches to Morpho-syntactic Tagging of German

Author: Erhard W. Hinrichs
Julia S. Trushkina
Publication venue
Publication date
Field of study

Rule-based and statistical approaches constitute the two leading paradigms in computational linguistics. This paper applies the two types of approaches to the task of assigning morpho-syntactic categories to words in German, a language with rich inectional morphology. The rule-based approach uses the Xerox Incremental Deep Parsing System and provides a novel constraint-based framework that integrates phrase-internal concord rules and phrase-external syntactic heuristics into one uniform architecture. The statistical approach utilizes the PCFG-parser LoPar which yields acceptable results even for moderate amounts of manuallyannotated treebank training data. It is shown that tree transformations constitute a crucial step in weakening the independence assumptions inherent in probabilistic context-free grammars and in optimizing the performance for the task at hand

CiteSeerX